-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ORT 1.17.0 Release] Cherry pick 1st round #19243
Conversation
### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->
…19182) ### Description Extends the code coverage to Entroy, Histogram and Distribution calibration method, fix bugs while doing it. ### Motivation and Context Bugs detected in [Olive](https://github.com/microsoft/OLive).
### Description allow proxy to load model with 1GB <= size < 2GB resolves #19157.
…ry (#19174) ### Description Check the ep_cache_context node property for EPContext node, and don't allow relative path like "../file_path"
### Description <!-- Describe your changes. --> 1. Make JBLAS codes an external module of ORT. 2. Move q4 gemm code to contrib_ops. 3. Update template kernel library to v0.1 release. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> We found that the current LLM model performance is far below our expectations. Here is some performance data collected on Mistral-7B model with Xeon-8480: 8 threads | prompt length=32 past_len=32 | prompt length=1 past_len=32 -- | -- | -- ORT-main | 1220ms | 263ms Neural-speed | 564ms | 87ms ORT-this PR|597ms|120ms Although `Neural-speed` and `ORT-this PR` use the same int4 kernel code, there is a 33ms(87ms vs. 120ms) latency gap between the two frameworks. Through some statistics analysis, the summary latency of `MatMulNBits` is 86.7ms The summary latency of all int4 GEMMs in `Neural-speed` is 84.8ms. So other OPs introduce an extra 30ms latency. The performance of MatMulNBits in this PR meets our expectations. ### Remain Issues 1. For hybrid CPUs, like core 12900K, the ONNXRuntime thread pool uses TaskGranularityFactor to scale its number of threads. This is not expected in our code design. It may slow down the hybrid CPU performance by 30~40%. 2. Prepack uses a single thread which is very slow to init a session. 3. MatMulNBits with zero points will fall through to COMP_FP32 even accuracy_level=4. Our COMP_INT8 IGemmCore with zero points process is not optimized for now. It will be updated in the future. So, for an int4 model with zero points, whether the accuracy_level is 0 or 4 will be no difference.
### Description upgrade packages version. ``` # npm audit report electron 23.0.0-alpha.1 - 23.3.13 Severity: moderate ASAR Integrity bypass via filetype confusion in electron - GHSA-7m48-wc93-9g85 fix available via `npm audit fix --force` Will install [email protected], which is a breaking change node_modules/electron get-func-name <2.0.1 Severity: high Chaijs/get-func-name vulnerable to ReDoS - GHSA-4q6p-r6v2-jvc5 fix available via `npm audit fix` node_modules/get-func-name semver <=5.7.1 || 6.0.0 - 6.3.0 || 7.0.0 - 7.5.1 Severity: moderate semver vulnerable to Regular Expression Denial of Service - GHSA-c2qf-rxjj-qqgw semver vulnerable to Regular Expression Denial of Service - GHSA-c2qf-rxjj-qqgw semver vulnerable to Regular Expression Denial of Service - GHSA-c2qf-rxjj-qqgw fix available via `npm audit fix` node_modules/cross-spawn/node_modules/semver node_modules/global-agent/node_modules/semver node_modules/semver ```
### Description This PR updates the LLaMA-2 attention fusions by adding the following. - Loading the PyTorch model from Hugging Face with the `LlamaAttention` class before exporting - Updating the attention mask pattern matching to support another case This PR also fixes [this issue](#19040). ### Motivation and Context Recent changes to Hugging Face's `transformers` library break the existing pattern matching. Since the attention fusions aim to change the graph from `LayerNorm Op --> Set of Attention Nodes --> LayerNorm Op` to `LayerNorm Op --> Attention Op --> LayerNorm Op` per layer, ultimately it does not matter what nodes comprise the `Set of Attention Nodes` because they will all be removed and replaced by the `Attention Op` in the end. Therefore, it does not matter whether the `LlamaAttention` class or a different attention class is used to load the PyTorch model before exporting because the expected graphs after the attention fusions will look identical no matter the attention class chosen. By loading the PyTorch model with the `LlamaAttention` class instead of other attention classes (e.g. `LlamaFlashAttention2` or `LlamaSdpaAttention`) and then exporting it to ONNX, the existing pattern matching will continue to work.
… is not guaranteed (#19195) Fix issue that the generated context cache model inputs/outputs order is not guaranteed ### Description Currently, QNN EP generate the context cache model in Compile() method which only get access to the partitioned graph. And the inputs/outputs order for the partitioned graph is not guaranteed. And EP doesn't have the view of the input user model. Have to move the context cache model generation to a higher level in GraphPartitioner which has the view of the partitioned model. This is also a break down of PR for multi-partition support. #18865
…der options (#19154) Several changes: 1. To align with other EPs' setting of EP context configs in session options, for example [QNN EP](#18877), EP context configs for TRT EP can be configured through: 1. Session Options: `ep.context_enable`, `ep.context_file_path` and `ep.context_embed_mode` 2. Provider Options: `trt_dump_ep_context_model`, `trt_ep_context_file_path` and `trt_dump_ep_context_embed_mode` 3. Above setting has 1:1 mapping and provider options has higher priority over session options. ``` Please note that there are rules for using following context model related provider options: 1. In the case of dumping the context model and loading the context model, for security reason, TRT EP doesn't allow the "ep_cache_context" node attribute of EP context node to be the absolute path or relative path that is outside of context model directory. It means engine cache needs to be in the same directory or sub-directory of context model. 2. In the case of dumping the context model, the engine cache path will be changed to the relative path of context model directory. For example: If "trt_dump_ep_context_model" is enabled and "trt_engine_cache_enable" is enabled, if "trt_ep_context_file_path" is "./context_model_dir", - if "trt_engine_cache_path" is "" -> the engine cache will be saved to "./context_model_dir" - if "trt_engine_cache_path" is "engine_dir" -> the engine cache will be saved to "./context_model_dir/engine_dir" ``` 2. User can decide the naming of the dumped "EP context" model by using `trt_ep_context_file_path`, please see GetCtxModelPath() for more details. 3. Added suggested comments from #18217
### Description <!-- Describe your changes. --> 1. support causal mask in MHA cpu 2. support custom rotary_dim in rotary_emb 3. add bf16 for rotary_emb 4. fix a bug in attention rotary ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->
### Description - Adds the following session options to configure the device: - `soc_model`: The SoC model number. Refer to the QNN SDK documentation for valid values. Defaults to "0" (unknown). - `htp_arch`: The minimum HTP architecture the driver will use to select compatible QNN operators. - `device_id`: The ID of the device to use when setting 'htp_arch'. Defaults to "0" (for single device). ### Motivation and Context Allow more configuration.
…oat16 (#17031) ### Description This PR adds SbgemmKernel for aarch64. This includes Sbegmm kernel to implement matrix multiplication with bfloat16 SIMD instructions (bfmmla) and MatMul operator changes to invoke the Sbgemm kernel. To enable Sbgemm kernel, set the following session option: "kOrtSessionOptionsGemmFastMathMode" The PR also adds new test cases for mlas and ort. ### Motivation and Context This is to improve MatMul performance on aarch64 platform. I have run the below benchmarking script (bert , roberta and gpt2 model inference) on AWS Graviton3 based c7g.4xl instance and observed 1.2x -1.76x performance improvement compared to sgemm (fp32) kernel performance. ``` cd onnxruntime/python/tools/transformers python3 benchmark.py ``` And the unit test precision results are matching to sgemm kernel results. `./build.sh --config RelWithDebInfo --build_shared_lib --parallel --compile_no_warning_as_error --skip_submodule_sync `
### Description Adds a job to create a nightly python package for ORT/QNN on Windows ARM64. Must build onnxruntime-qnn with python 3.11 and numpy 1.25. **Note: pipeline run may take up to 3 hrs** ### Motivation and Context Make it possible to get a nightly python package with the latest updates to QNN EP. Issue #19161
### Description Update unet fusion for [stable diffusion webui extension](https://github.com/tianleiwu/Stable-Diffusion-WebUI-OnnxRuntime): (1) Update fusion pattern to support fp16 unet model. (2) Add progress bar (3) Use a cached map to speed up dtype or shape lookup in shape inference result. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->
### Description <!-- Describe your changes. --> Add BuildArch To verify: https://aiinfra.visualstudio.com/Lotus/_build/results?buildId=400952&view=logs&j=5b022bb4-70a7-5401-8766-a8a7802c7150&t=291e85c7-5547-590b-50de-4e01fcd4eba3&l=14 ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->
8bdea53
I suggest holding off it a moment, since this PR will add a new dependency, neural-speed, to ONNX Runtime. I have some concerns to it. I've just sent an email to the author who added this component, and another email to a few PMs to discuss it. I do not have any objection on adding the dependency, but there are some details that need to be figured out. Please give me a few days to complete the work. |
ok, noticed. |
### Description <!-- Describe your changes. --> remove old python files ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> We have a new op MatMulNBits and this one is deprecated.
…age (#19251) C.register_tensorrt_plugins_as_custom_ops() is only available in gpu python package. Add condition to avoid calling it in cpu python package.
TRT EP's GetTensorRTCustomOpDomainList() will create vector of OrtCustomOpDomain objects and release the ownership of those objects. But, thoses objects are not released forever. In session level, we need to make TRT EP remember what OrtCustomOpDomain objects it created and release them at EP destruction time.
### Description Since Cutlass can be built with CUDA 11.4 (The minimum CUDA version for onnxruntime CUDA build), there is no need to have a flag to disable cutlass. Changes: (1) Reverted #18761 (2) remove the condition to build cutlass. (3) Fix a few build errors or warnings during testing CUDA 11.4 build. Note that SM 89 and 90 (including fp8) requires CUDA 11.8 or later. Flash attention and cutlass fused multihead attention will not be built for CUDA < 11.6. It is recommended to use CUDA 11.8 or above to build if you want to support latest GPUs. It is better to include it in 1.17.0 (otherwise, the release branch might encounter build failure with CUDA 11.4). Tests: (1) Build with flash attention and efficient attention off: **passed** (2) Build with CUDA 11.4: **passed** Example build command used in Ubuntu 20.04: ``` export CUDA_HOME=/usr/local/cuda-11.4 export CUDNN_HOME=/usr/lib/x86_64-linux-gnu/ export CUDACXX=/usr/local/cuda-11.4/bin/nvcc sh build.sh --config Release --build_shared_lib --parallel --use_cuda --cuda_version 11.4 \ --cuda_home $CUDA_HOME --cudnn_home $CUDNN_HOME --build_wheel --skip_tests \ --cmake_extra_defines CMAKE_CUDA_ARCHITECTURES=80 \ --disable_types float8 ``` ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->
### Description Update abseil to a release tag and register neural_speed to CG. ### Motivation and Context Now we are using a non-relesed version of abseil. Using a tag is better.
### Description <!-- Describe your changes. --> [ORT 1.17.0 Release] Cherry pick 1st round PR authors please take a look, and let me know if there are any questions about the changes or approve accordingly. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: wejoncy <[email protected]> Co-authored-by: Xavier Dupré <[email protected]> Co-authored-by: Yulong Wang <[email protected]> Co-authored-by: Hector Li <[email protected]> Co-authored-by: luoyu-intel <[email protected]> Co-authored-by: kunal-vaishnavi <[email protected]> Co-authored-by: Chi Lo <[email protected]> Co-authored-by: Ye Wang <[email protected]> Co-authored-by: Adrian Lizarraga <[email protected]> Co-authored-by: snadampal <[email protected]> Co-authored-by: Tianlei Wu <[email protected]> Co-authored-by: Heflin Stephen Raj <[email protected]> Co-authored-by: Yifan Li <[email protected]> Co-authored-by: Yufeng Li <[email protected]> Co-authored-by: Changming Sun <[email protected]>
Description
[ORT 1.17.0 Release] Cherry pick 1st round
PR authors please take a look, and let me know if there are any questions about the changes or approve accordingly.
Motivation and Context